Skip to content

test(e2e): run gpu workloads from manifest#1709

Open
elezar wants to merge 3 commits into
mainfrom
feat/1472-gpu-validation-tests/elezar
Open

test(e2e): run gpu workloads from manifest#1709
elezar wants to merge 3 commits into
mainfrom
feat/1472-gpu-validation-tests/elezar

Conversation

@elezar

@elezar elezar commented Jun 3, 2026

Copy link
Copy Markdown
Member

Summary

This PR adds manifest-driven GPU workload execution tests on top of the workload image artifacts from #1484. It keeps the existing GPU device-selection coverage, adds workload execution coverage under the umbrella gpu target, and documents how to build workload images locally before running the GPU e2e suite.

This branch is now rebased on the local e2e stabilization fixes from #1935, so the Docker GPU test path also includes the supervisor-image and host SSH linker-environment fixes needed for local Nix/devenv runs.

Related Issue

Closes #1472

Changes

  • Switch GPU workload execution tests from a single image env var to a YAML workload manifest consumed by the Rust e2e harness.
  • Run the manifest-defined workloads through openshell sandbox create --gpu --from <image> -- <command> and enforce declared pass or fail expectations.
  • Load the local manifest from e2e/gpu/images/.build/workloads.yaml by default, with OPENSHELL_E2E_WORKLOAD_MANIFEST available for external manifests.
  • Update the Docker GPU e2e wrapper to point users at the workload manifest flow when no local manifest exists.
  • Add serde_yaml to the e2e crate for manifest parsing.
  • Include the local e2e fixes from fix(docker): honor configured supervisor image #1935: configured Docker supervisor image handling and host SSH linker-environment isolation.

Testing

  • mise run pre-commit passes
  • Unit tests added/updated
  • E2E tests added/updated (if applicable)

Validation status:

  • mise run e2e:docker:gpu
  • mise run pre-commit was run after rebasing onto main; Rust format/check/clippy, markdown lint, Python format, license checks, and docs checks completed successfully.
  • mise run pre-commit currently fails in helm:lint because the local chart dependency directory is missing the postgresql dependency. This is unrelated to the GPU workload changes.

GPU validation commands for future runs:

  • mise run e2e:workloads:build
  • mise run e2e:docker:gpu

Notes:

  • Build workload images and generate the local manifest with mise run e2e:workloads:build before running mise run e2e:docker:gpu locally.
  • External catalogs can be exercised by setting OPENSHELL_E2E_WORKLOAD_MANIFEST=/abs/path/to/workloads.yaml.

Checklist

  • Follows Conventional Commits
  • Commits are signed off (DCO)
  • Architecture docs updated (if applicable)

@copy-pr-bot

copy-pr-bot Bot commented Jun 3, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@elezar elezar force-pushed the feat/1476-gpu-workload-images/elezar branch from 5cc2d92 to efe4d25 Compare June 4, 2026 12:56
@elezar elezar force-pushed the feat/1472-gpu-validation-tests/elezar branch from 5a84bca to 1c8f7b7 Compare June 4, 2026 14:13
@copy-pr-bot

copy-pr-bot Bot commented Jun 4, 2026

Copy link
Copy Markdown

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@elezar elezar force-pushed the feat/1476-gpu-workload-images/elezar branch 2 times, most recently from de40d64 to 8426fac Compare June 10, 2026 20:54
Base automatically changed from feat/1476-gpu-workload-images/elezar to main June 15, 2026 18:26
@elezar elezar force-pushed the feat/1472-gpu-validation-tests/elezar branch from 1c8f7b7 to c5182b1 Compare June 16, 2026 08:33
@github-actions

Copy link
Copy Markdown

@elezar elezar force-pushed the feat/1472-gpu-validation-tests/elezar branch from 032f133 to 55ed9ce Compare June 17, 2026 07:03
@elezar elezar marked this pull request as ready for review June 17, 2026 09:58
@elezar elezar added test:e2e Requires end-to-end coverage test:e2e-gpu Requires GPU end-to-end coverage labels Jun 17, 2026
@github-actions

Copy link
Copy Markdown

Label test:e2e-gpu applied for 2f36b22. Open the existing run and click Re-run all jobs to execute with the label set. The run will execute GPU E2E after building the required supervisor image once. The matching required CI gate status on this PR will flip green automatically once the run finishes.

@github-actions

Copy link
Copy Markdown

Label test:e2e applied for 2f36b22. Open the existing run and click Re-run all jobs to execute with the label set. The run will execute the standard E2E suite after building the required gateway and supervisor images once. The matching required CI gate status on this PR will flip green automatically once the run finishes.

@elezar elezar force-pushed the feat/1472-gpu-validation-tests/elezar branch from 2f36b22 to 386d638 Compare June 17, 2026 12:35
@elezar elezar added the gator:blocked Gator is blocked by process or repository gates label Jun 23, 2026
@elezar

elezar commented Jun 23, 2026

Copy link
Copy Markdown
Member Author

gator-agent

Blocked

Gator is blocked by merge conflicts with the base branch. GitHub reports mergeable_state=dirty for head dbdc8134896c94f5f9babeae6fd8b5fb649494ab.

Next action: @elezar, please rebase or merge the base branch and resolve the conflicts, then push an updated head so gator can validate and review the PR.

@elezar elezar force-pushed the feat/1472-gpu-validation-tests/elezar branch 2 times, most recently from 88e756c to 6c30486 Compare June 25, 2026 08:31
@elezar

elezar commented Jun 25, 2026

Copy link
Copy Markdown
Member Author

gator-agent

Re-check After Author Update

I re-evaluated latest head 6c30486aaa9df6a1f63608cfea6f98266eca214e after the branch update since the prior merge-conflict blocker, and ran a bounded independent review of the current diff.

Disposition: partially resolved.

Remaining items:

  • Resolved: GitHub now reports mergeable=true and mergeable_state=blocked, so the previous dirty merge-conflict blocker is cleared.
  • Still unresolved: .github/workflows/e2e-gpu-test.yaml builds workload images without pinning the container engine. tasks/scripts/container-engine.sh auto-detects Podman first when present, so this Docker GPU workflow can build workload images into Podman storage and then run e2e:docker:gpu against Docker, where those image refs are unavailable. Please run the workload build with CONTAINER_ENGINE=docker, for example: CONTAINER_ENGINE=docker mise run --no-deps --skip-deps e2e:workloads:build.
  • Nonblocking suggestion: tasks/test.toml now runs the full gpu suite for Podman too; consider adding matching manifest setup guidance to e2e/rust/e2e-podman.sh so users do not accidentally run Podman GPU e2e with a Docker-built manifest.
  • OpenShell / Branch Checks and OpenShell / Helm Lint are passing. OpenShell / E2E and OpenShell / GPU E2E are still pending for this head.

Docs: GPU e2e documentation and gateway config reference updates are present; no docs navigation change appears necessary.

Next state: gator:in-review

@elezar elezar added gator:in-review Gator is reviewing or awaiting PR review feedback and removed gator:blocked Gator is blocked by process or repository gates labels Jun 25, 2026
@elezar

elezar commented Jun 25, 2026

Copy link
Copy Markdown
Member Author

gator-agent

Re-check After CI Update

I re-evaluated latest head 6c30486aaa9df6a1f63608cfea6f98266eca214e after the required gates completed.

Disposition: partially resolved, but not ready to leave review.

Remaining items:

  • Resolved: OpenShell / Branch Checks, OpenShell / Helm Lint, DCO, OpenShell / E2E, and OpenShell / GPU E2E are passing for this head.
  • Still unresolved: .github/workflows/e2e-gpu-test.yaml builds workload images without pinning the container engine. The workflow should run the workload build with CONTAINER_ENGINE=docker, for example CONTAINER_ENGINE=docker mise run --no-deps --skip-deps e2e:workloads:build, or explain why Podman-first auto-detection cannot affect this Docker GPU path.
  • Nonblocking suggestion remains: add matching manifest setup guidance to e2e/rust/e2e-podman.sh for Podman GPU e2e users.

Next state: gator:in-review

@elezar elezar force-pushed the feat/1472-gpu-validation-tests/elezar branch from 6c30486 to 0e9f9fc Compare June 25, 2026 13:35
@elezar

elezar commented Jun 25, 2026

Copy link
Copy Markdown
Member Author

gator-agent

Re-check After Author Update

I re-evaluated latest head 0e9f9fc605e36400cf32fbd9370e88d77019ec44 after the branch update since the prior gator:in-review feedback, and ran a bounded independent re-review of the current diff.

Disposition: resolved for review; CI is still in progress.

Remaining items:

  • Resolved: .github/workflows/e2e-gpu-test.yaml now sets CONTAINER_ENGINE=docker before building workload images, so the Docker GPU workflow builds workload images into the same engine used by e2e:docker:gpu.
  • No blocking gator review findings remain.
  • Nonblocking suggestions from the independent review: mirror the missing-manifest guidance in the Podman GPU wrapper, update the stale workflow comment path from e2e/rust/tests/gpu_device_selection.rs to e2e/rust/tests/gpu/device_selection.rs, refresh the TESTING.md Docker GPU description, and keep the serde_yaml/unsafe-libyaml dependency scoped to e2e-only use.
  • OpenShell / Branch Checks and OpenShell / Helm Lint are passing. OpenShell / E2E and OpenShell / GPU E2E are still pending for this head.

Docs: GPU e2e documentation is updated under e2e/gpu/README.md; no Fern docs or navigation change appears required for this internal test harness change.

Next state: gator:watch-pipeline

@elezar elezar added gator:watch-pipeline Gator is monitoring PR CI/CD status and removed gator:in-review Gator is reviewing or awaiting PR review feedback labels Jun 25, 2026
@elezar

elezar commented Jun 25, 2026

Copy link
Copy Markdown
Member Author

gator-agent

Maintainer Approval Needed

Gator validation and PR monitoring are complete.

Validation: maintainer-authored, project-valid GPU E2E test-harness work linked to #1472.
Review: no blocking gator review findings remain. Earlier GPU workload image build feedback is resolved at head 0e9f9fc605e36400cf32fbd9370e88d77019ec44; remaining notes are nonblocking.
Docs: GPU E2E documentation is updated under e2e/gpu/README.md; no Fern docs or navigation update is required for this internal test harness change.
Checks: OpenShell / Branch Checks, OpenShell / Helm Lint, DCO, OpenShell / E2E, and OpenShell / GPU E2E are passing for head 0e9f9fc605e36400cf32fbd9370e88d77019ec44.
E2E: test:e2e and test:e2e-gpu are applied and the required core and GPU E2E gates are passing.

Human maintainer approval or merge decision is now required.

@elezar elezar added gator:approval-needed Gator completed review; maintainer approval needed and removed gator:watch-pipeline Gator is monitoring PR CI/CD status labels Jun 25, 2026
@elezar elezar force-pushed the feat/1472-gpu-validation-tests/elezar branch 3 times, most recently from 59572f2 to a351e3c Compare June 25, 2026 15:08
@elezar elezar added gator:blocked Gator is blocked by process or repository gates and removed gator:approval-needed Gator completed review; maintainer approval needed labels Jun 25, 2026
@elezar

elezar commented Jun 25, 2026

Copy link
Copy Markdown
Member Author

gator-agent

Blocked

Gator is blocked from completing the required independent re-review for current head a351e3c14b75a842ca08396e34e7c0fca47d2821 because the principal-engineer-reviewer sub-agent failed before producing a review result due to a Codex token refresh/authentication error.

The PR also has new CI in progress for this head, so it is not ready to stay in gator:approval-needed yet.

Next action: OpenShell sandbox operator should refresh or relaunch gator with a working reviewer sub-agent, then re-run gator so the current diff can be independently reviewed and the pending checks can be reconciled. No PR author action is requested by this blocker.

@elezar

elezar commented Jun 25, 2026

Copy link
Copy Markdown
Member Author

gator-agent

Re-check After Reviewer Update

I re-evaluated latest head a351e3c14b75a842ca08396e34e7c0fca47d2821 after the prior gator blocker, and the required principal-engineer-reviewer sub-agent completed successfully for the current diff.

Disposition: blocker resolved, but review follow-up is still needed.

Remaining items:

  • Resolved: the previous process blocker was only the failed reviewer sub-agent run; the independent review has now completed.
  • Still unresolved: tasks/scripts/e2e-gpu-build-images.sh now depends directly on GNU sha256sum for file and input hashing. OpenShell supports macOS developer workflows, and nearby project tasks use a sha256sum / shasum -a 256 fallback. Please add a small hash helper so local GPU workload image builds work on non-GNU userlands too.
  • Nonblocking suggestions: mirror the missing-manifest guidance in e2e/rust/e2e-podman.sh, refresh the TESTING.md Docker GPU description for manifest workload validation, and keep the deprecated serde_yaml / unsafe-libyaml dependency scoped to the e2e crate only.

Checks: OpenShell / Branch Checks, OpenShell / Helm Lint, DCO, OpenShell / E2E, and OpenShell / GPU E2E are passing for this head.

Docs: GPU E2E documentation is updated under e2e/gpu/README.md; no Fern docs or navigation update appears required for this internal test harness change.

Next state: gator:in-review

@elezar elezar added gator:in-review Gator is reviewing or awaiting PR review feedback and removed gator:blocked Gator is blocked by process or repository gates labels Jun 25, 2026
elezar added 3 commits June 25, 2026 18:13
Signed-off-by: Evan Lezar <elezar@nvidia.com>
Signed-off-by: Evan Lezar <elezar@nvidia.com>
Signed-off-by: Evan Lezar <elezar@nvidia.com>
@elezar elezar force-pushed the feat/1472-gpu-validation-tests/elezar branch from a351e3c to 87ea21e Compare June 25, 2026 16:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

gator:in-review Gator is reviewing or awaiting PR review feedback test:e2e Requires end-to-end coverage test:e2e-gpu Requires GPU end-to-end coverage

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Define GPU validation tests for GPU-enabled drivers

1 participant